What is cluster analysis?

A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics (features).

Distance between two observations

Distance vs Similarity

Distance=1Similarity

Calculate & plot the distance between two players

You’ve obtained the coordinates relative to the center of the field for two players in a soccer match and would like to calculate the distance between them.

In this exercise you will plot the positions of the 2 players and manually calculate the distance between them by using the Euclidean distance formula.

x <- c(5, 15)
y <- c(4, 10)
two_players <- data.frame(x,y)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Plot the positions of the players
ggplot(two_players, aes(x = x, y = y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations 
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619

Excellent work! Using the formula is a great way to learn how distance is measured between two observations.

Using the dist() function

Using the Euclidean formula manually may be practical for 2 observations but can get more complicated rather quickly when measuring the distance between many observations.

The dist() function simplifies this process by calculating distances between our observations (rows) using their features (columns). In this case the observations are the player positions and the dimensions are their x and y coordinates.

Note: The default distance calculation for the dist() function is Euclidean distance

three_players <- rbind(two_players, c(0, 20))

# Calculate the Distance Between two_players
dist_two_players <- dist(two_players)
dist_two_players
##         1
## 2 11.6619
# Calculate the Distance Between three_players
dist_three_players <- dist(three_players)
dist_three_players
##          1        2
## 2 11.66190         
## 3 16.76305 18.02776

The dist() function makes life easier when working with many dimensions and observations.

The importance of scale

Scaling our features

heightscaled=heightmean(height)sd(height)

Effects of scale

You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees data set.

You will leverage the scale() function which by default centers & scales our column features.

Our variables are the following:

Girth <- c(8.3, 8.6, 10.5)
Height <- c(840, 780, 864)
three_trees <- data.frame(Girth, Height)

# Calculate distance for three_trees 
dist_trees <- dist(three_trees)

# Scale three trees & calculate the distance  
scaled_three_trees <- scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)

# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
##          1        2
## 2 60.00075         
## 3 24.10062 84.02149
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
##          1        2
## 2 1.409365         
## 3 1.925659 2.511082

Notice that before scaling observations 1 & 3 were the closest but after scaling observations 1 & 2 turn out to have the smallest distance.

Measuring distance for categorical data

Jaccard index = Similarity

= intersection of A, B = cases where both are true = union between A, B = cases where either is ever true

Calculating distance between categorical variables

In this exercise you will explore how to calculate binary (Jaccard) distances. In order to calculate distances we will first have to dummify our categories using the dummy.data.frame() from the library dummies

You will use a small collection of survey observations stored in the data frame job_survey with the following columns:

library(dummies)
## dummies-1.5.6 provided by Decision Patterns
job_satisfaction <- as.factor(c("Low", "Low", "High", "Low", "Mid"))
is_happy <- as.factor(c("No", "No", "Yes", "No", "No"))
job_survey <- data.frame(job_satisfaction, is_happy)

# Dummify the Survey Data
(dummy_survey <- dummy.data.frame(job_survey))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
##   job_satisfactionHigh job_satisfactionLow job_satisfactionMid is_happyNo
## 1                    0                   1                   0          1
## 2                    0                   1                   0          1
## 3                    1                   0                   0          0
## 4                    0                   1                   0          1
## 5                    0                   0                   1          1
##   is_happyYes
## 1           0
## 2           0
## 3           1
## 4           0
## 5           0
# Calculate the Distance
dist_survey <- dist(dummy_survey, method = 'binary')

# Print the Original Data
job_survey
##   job_satisfaction is_happy
## 1              Low       No
## 2              Low       No
## 3             High      Yes
## 4              Low       No
## 5              Mid       No
# Print the Distance Matrix
dist_survey
##           1         2         3         4
## 2 0.0000000                              
## 3 1.0000000 1.0000000                    
## 4 0.0000000 0.0000000 1.0000000          
## 5 0.6666667 0.6666667 1.0000000 0.6666667

Notice that this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)

Comparing more than two observations

Hierarchical clustering

Complete Linkage: maxomum distance between two sets

Linkage criteria

complete linkage: maximum distance between two sets single linkage: minimum distance between two sets average linkage: average distance between two sets

Calculating linkage

Let us revisit the example with three players on a field. The distance matrix between these three players is shown below and is available as the variable dist_players.

From this we can tell that the first group that forms is between players 1 & 2, since they are the closest to one another with a Euclidean distance value of 11.

Now you want to apply the three linkage methods you have learned to determine what the distance of this group is to player 3.

1 2
2 11
3 16 18
dist_players <- dist(three_players)

# Extract the pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]

# Calculate the complete distance between group 1-2 and 3
complete <- max(c(distance_1_3, distance_2_3))
complete
## [1] 18.02776
# Calculate the single distance between group 1-2 and 3
single <- min(c(distance_1_3, distance_2_3))
single
## [1] 16.76305
# Calculate the average distance between group 1-2 and 3
average <- mean(c(distance_1_3, distance_2_3))
average
## [1] 17.39541

The choice of the linkage method can drastically change the result.

Capturing K clusters

Assign cluster membership

In this exercise you will leverage the hclust() function to calculate the iterative linkage steps and you will use the cutree() function to extract the cluster assignments for the desired number (k) of clusters.

You are given the positions of 12 players at the start of a 6v6 soccer match. This is stored in the lineup data frame.

You know that this match has two teams (k = 2), let’s use the clustering methods you learned to assign which team each player belongs in based on their position.

Notes:

x <- c(-1, -2, 8, 7, -12, -15, -13, 15, 21, 12, -25, 26)
y <- c(1, -3, 6, -8, 8, 0, -10, 16, 2, -15, 1, 0)
(lineup <- tibble(x, y))
## # A tibble: 12 x 2
##        x     y
##    <dbl> <dbl>
##  1    -1     1
##  2    -2    -3
##  3     8     6
##  4     7    -8
##  5   -12     8
##  6   -15     0
##  7   -13   -10
##  8    15    16
##  9    21     2
## 10    12   -15
## 11   -25     1
## 12    26     0
# Calculate the Distance
dist_players <- dist(lineup)

# Perform the hierarchical clustering using the complete linkage
hc_players <- hclust(dist_players, method = "complete")

# Calculate the assignment vector with a k of 2
clusters_k2 <- cutree(hc_players, k = 2)

# Create a new data frame storing these results
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)

Fantastic job! In the next exercise we will explore this result.

Exploring the clusters

Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.

In this exercise you will explore that data frame you created in the previous exercise lineup_k2_complete.

Reminder: The lineup_k2_complete data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which you have added clustering assignments based on the following parameters:

# Count the cluster assignments
count(lineup_k2_complete, cluster)
## # A tibble: 2 x 2
##   cluster     n
##     <int> <int>
## 1       1     6
## 2       2     6
# Plot the positions of the players and color them using their cluster
ggplot(lineup_k2_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

You’re doing great! Think carefully about whether these results make sense to you and why.

Visualizing the dendrogram

Comparing average, single & complete linkage

You are now ready to analyze the clustering results of the lineup dataset using the dendrogram plot. This will give you a new perspective on the effect the decision of the linkage method has on your resulting cluster analysis.

# Prepare the Distance Matrix
dist_players <- dist(lineup)

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage') #maximum euclidian distance <= height of the branch
plot(hc_single, main = 'Single Linkage')     #minimum euclidian distance <= height of the branch
plot(hc_average, main = 'Average Linkage')   #average euclidian distance <= height of the branch

Excellent! Did you notice how the trees all look different? In the coming exercises you will see how visualizing this structure can be helpful for building clusters.

Cutting the tree

all members of the clusters shown here will have a euclidian distance no greater than the cut height of 15.

Clusters based on height

In previous exercises you have grouped your observations into clusters using a pre-defined number of clusters (k). In this exercise you will leverage the visual representation of the dendrogram in order to group your observations into clusters using a maximum height (h), below which clusters form.

You will work the color_branches() function from the dendextend library in order to visually inspect the clusters that form at any height along the dendrogram.

The hc_players has been carried over from your previous work with the soccer line-up data.

library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)

Excellent! Can you see that the height that you use to cut the tree greatly influences the number of clusters and their size? Consider taking a moment to play with other values of height before continuing.

Exploring the branches cut from the tree

The cutree() function you used in exercises 5 & 6 can also be used to cut a tree at a given height by using the h parameter. Take a moment to explore the clusters you have generated from the previous exercises based on the heights 20 & 40.

dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Calculate the assignment vector with a h of 20
clusters_h20 <- cutree(hc_players, h = 20)

# Create a new data frame storing these results
lineup_h20_complete <- mutate(lineup, cluster = clusters_h20)

# Calculate the assignment vector with a h of 40
clusters_h40 <- cutree(hc_players, h = 40)

# Create a new data frame storing these results
lineup_h40_complete <- mutate(lineup, cluster = clusters_h40)

# Plot the positions of the players and color them using their cluster for height = 20
ggplot(lineup_h20_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

# Plot the positions of the players and color them using their cluster for height = 40
ggplot(lineup_h40_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Great job! You can now explore your clusters using both k and h parameters.

The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.

Making sense of the clusters

Exploring more than 2 dimensions

Segment wholesale customers

You’re now ready to use hierarchical clustering to perform market segmentation (i.e. use consumer characteristics to group them into subgroups).

In this exercise you are provided with the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. Assign these clients into meaningful clusters.

Note: For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

customers_spend <- readRDS("_data/ws_customers.rds")

# Calculate Euclidean distance between customers
dist_customers <- dist(customers_spend)

# Generate a complete linkage analysis 
hc_customers <- hclust(dist_customers, method = "complete")

# Plot the dendrogram
plot(hc_customers)

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h = 15000)

# Generate the segmented customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)

Excellent! Let’s move on to the next exercise and explore these clusters.

Explore wholesale customer clusters

Continuing your work on the wholesale dataset you are now ready to analyze the characteristics of these clusters.

Since you are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead you will rely on summary statistics to explore these clusters. In this exercise you will analyze the mean amount spent in each cluster for all three categories.

dist_customers <- dist(customers_spend)
hc_customers <- hclust(dist_customers)
clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)

# Count the number of customers that fall into each cluster
count(segment_customers, cluster)
##   cluster  n
## 1       1  5
## 2       2 29
## 3       3  5
## 4       4  6
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)

# Plot the colored dendrogram
plot(dend_colored)

# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
## # A tibble: 4 x 4
##   cluster   Milk Grocery Frozen
##     <int>  <dbl>   <dbl>  <dbl>
## 1       1 16950   12891.   991.
## 2       2  2513.   5229.  1796.
## 3       3 10452.  22551.  1355.
## 4       4  1250.   3917. 10889.

Great work! You’ve gathered a bunch of information about these clusters, now let’s see what can be interpreted from them.

  1. Customers in cluster 1 spent more money on Milk than any other cluster.
  2. Customers in cluster 3 spent more money on Grocery than any other cluster.
  3. Customers in cluster 4 spent more money on Frozen goods than any other cluster.
  4. The majority of customers fell into cluster 2 and did not show any excessive spending in any category.

All 4 statements are reasonable, but whether they are meaningful depends heavily on the business context of the clustering.

Introduction to K-means

K-means on a soccer field

In the previous chapter you used the lineup dataset to learn about hierarchical clustering, in this chapter you will use the same data to learn about k-means clustering. As a reminder, the lineup data frame contains the positions of 12 players at the start of a 6v6 soccer match.

Just like before, you know that this match has two teams on the field so you can perform a k-means analysis using k = 2 in order to determine which player belongs to which team.

Note that in the kmeans() function k is specified using the centers parameter.

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster

# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Well done! Knowing the desired number of clusters ahead of time can be very helpful when performing a k-means analysis. In the next section we will see what happens when we use an incorrect value of k.

K-means on a soccer field (part 2)

In the previous exercise you successfully used the k-means algorithm to cluster the two teams from the lineup data frame. This time, let’s explore what happens when you use a k of 3.

You will see that the algorithm will still run, but does it actually make sense in this context…

# Build a kmeans model
model_km3 <- kmeans(lineup, centers = 3)

# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster

# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster = clust_km3)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Does this result make sense? Remember we only have 2 teams on the field. It’s very important to remember that k-means will run with any k that is more than 2 and less than your total observations, but it doesn’t always mean the results will be meaningful.

Evaluating different values of K by eye

Many K’s many models

While the lineup dataset clearly has a known value of k, often times the optimal number of clusters isn’t known and must be estimated.

In this exercise you will leverage map_dbl() from the purrr library to run k-means using values of k ranging from 1 to 10 and extract the total within-cluster sum of squares metric from each one. This will be the first step towards visualizing the elbow plot.

library(purrr)

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = lineup, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
(elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
))
##     k tot_withinss
## 1   1    3489.9167
## 2   2    1434.5000
## 3   3     940.8333
## 4   4     637.2500
## 5   5     406.0000
## 6   6     265.1667
## 7   7     171.0000
## 8   8     134.5000
## 9   9      82.0000
## 10 10      23.0000

Great work! In the next exercise you will plot the elbow plot from this data.

Elbow (Scree) plot

In the previous exercises you have calculated the total within-cluster sum of squares for values of k ranging from 1 to 10. You can visualize this relationship using a line plot to create what is known as an elbow plot (or scree plot).

When looking at an elbow plot you want to see a sharp decline from one k to another followed by a more gradual decrease in slope. The last value of k before the slope of the plot levels off suggests a “good” value of k.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = lineup, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

Fantastic! You have learned how to create and visualize elbow plots as a tool for finding a “good” value of k. In the next section you will add another tool to your arsenal for finding k.

You can see that there is a sharp change in the slope of this line at k = 2 that makes an “elbow” shape. Furthermore, this is supported by the prior knowledge that there are two teams in this data and a k of 2 is desired.

Silhouette analysis: observation level performance

Silhouette analysis

Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:

In this exercise you will leverage the pam() and the silhouette() functions from the cluster library to perform silhouette analysis to compare the results of models with a k of 2 and a k of 3. You’ll continue working with the lineup dataset.

Pay close attention to the silhouette plot, does each observation clearly belong to its assigned cluster for k = 3?

library(cluster)

# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)

# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))

# Generate a k-means model using the pam() function with a k = 3
pam_k3 <- pam(lineup, k = 3)

# Plot the silhouette visual for the pam_k3 model
plot(silhouette(pam_k3))

Great work! Did you notice that for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters.

Making sense of the K-means clusters

Segmenting with K-means

Revisiting wholesale data: “Best” k

At the end of Chapter 2 you explored wholesale distributor data customers_spend using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.

The first step will be to determine the “best” value of k using average silhouette width.

A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.

# Use map_dbl to run many models with varying value of k
(sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = customers_spend, k = k)
  model$silinfo$avg.width
}))
## [1] 0.5842282 0.3741916 0.4225053 0.4397891 0.3609199 0.3270878 0.3263239
## [8] 0.3319012 0.3255676
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

You’re doing great! From the plot I hope you noticed that k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with.

Revisiting wholesale data: Exploration

From the previous analysis you have found that k = 2 has the highest average silhouette width. In this exercise you will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters.

set.seed(42)

# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(customers_spend, centers = 2)

# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster

# Build the segment_customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)

# Calculate the size of each cluster
count(segment_customers, cluster)
##   cluster  n
## 1       1 35
## 2       2 10
# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
## # A tibble: 2 x 4
##   cluster   Milk Grocery Frozen
##     <int>  <dbl>   <dbl>  <dbl>
## 1       1  2296.    5004  3354.
## 2       2 13701.   17721  1173

Well done! It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. Did you notice that when you explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got you 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Before you proceed with the next chapter, remember that: Generating clusters is a science, but interpreting them is an art.

Occupational wage data

Next steps: hierarchical clustering

Initial exploration of the data

You are presented with data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. You would like to use this data to identify clusters of occupations that maintained similar income trends.

The data is stored in your environment as the data.matrix oes.

Before you begin to cluster this data you should determine whether any pre-processing steps (such as scaling and imputation) are necessary.

Leverage the functions head() and summary() to explore the oes data in order to determine which of the pre-processing steps below are necessary:

There are no missing values, no categorical and the features are on the same scale.

oes <- readRDS("_data/oes.rds")

# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = 'euclidean')

# Generate an average linkage analysis 
hc_oes <- hclust(dist_oes, method = 'average')

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

# Plot the dendrogram
plot(dend_oes)

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)

Well done! Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let’s continue this exploration.

Hierarchical clustering: Preparing for exploration

You have now created a potential clustering for the oes data, before you can explore these clusters with ggplot2 you will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')

library(tibble)
library(tidyr)

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

# Generate the segmented the oes data frame
(clust_oes <- mutate(df_oes, cluster = cut_oes))
##                    occupation  2001  2002  2003  2004  2005  2006  2007   2008
## 1                  Management 70800 78870 83400 87090 88450 91930 96150 100310
## 2         Business Operations 50580 53350 56000 57120 57930 60000 62410  64720
## 3            Computer Science 60350 61630 64150 66370 67100 69240 72190  74500
## 4    Architecture/Engineering 56330 58020 60390 63060 63910 66190 68880  71430
## 5   Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020  64280
## 6          Community Services 34190 34630 35800 37050 37530 39000 40540  41790
## 7                       Legal 69030 77330 78590 81180 81070 85360 88450  92270
## 8  Education/Training/Library 39130 40160 41390 42810 43450 45320 46610  48460
## 9   Arts/Design/Entertainment 39770 41660 43350 43820 44310 46110 48410  50670
## 10   Healthcare Practitioners 49930 53990 56240 58310 59170 62030 65020  67890
## 11         Healthcare Support 21900 22410 22960 23510 23850 24610 25600  26340
## 12         Protective Service 32530 33330 34430 35240 35750 37040 38750  40200
## 13           Food Preparation 16720 17180 17400 17620 17840 18430 19440  20220
## 14  Grounds Cleaning & Maint. 20380 20850 21290 21670 21930 22580 23560  24370
## 15              Personal Care 21010 21370 21570 22080 22180 22920 23980  24120
## 16                      Sales 28920 30610 31560 32280 32800 34350 35240  36080
## 17      Office Administrative 27230 27910 28540 29390 29710 30370 31200  32220
## 18   Farming/Fishing/Forestry 19630 20220 20290 20670 21010 21810 22640  23560
## 19               Construction 35450 36340 37000 37890 38260 39290 40620  42350
## 20 Installation/Repair/Maint. 34960 35780 36560 37620 38050 39060 39930  41230
## 21                 Production 27600 28190 28930 29480 29890 30480 31310  32320
## 22      Transportation/Moving 26560 27220 27630 28250 28820 29460 30680  31450
##      2010   2011   2012   2013   2014   2015   2016 cluster
## 1  105440 107410 108570 110550 112490 115020 118020       1
## 2   67690  68740  69550  71020  72410  73800  75070       2
## 3   77230  78730  80180  82010  83970  86170  87880       2
## 4   75550  77120  79000  80100  81520  82980  84300       2
## 5   66390  67470  68360  69400  70070  71220  72930       2
## 6   43180  43830  44240  44710  45310  46160  47200       3
## 7   96940  98380  98570  99620 101110 103460 105980       1
## 8   50440  50870  51210  51500  52210  53000  54520       3
## 9   52290  53850  54490  55580  55790  56980  58390       3
## 10  71280  72730  73540  74740  76010  77800  79160       2
## 11  26920  27370  27780  28300  28820  29520  30470       3
## 12  42490  42730  43050  43510  43980  44610  45810       3
## 13  21240  21430  21380  21580  21980  22850  23850       3
## 14  25300  25560  25670  26010  26370  27080  28010       3
## 15  24590  24620  24550  24710  24980  25650  26510       3
## 16  36790  37520  37990  38200  38660  39320  40560       3
## 17  33470  34120  34410  34900  35530  36330  37260       3
## 18  24330  24300  24230  24330  25160  26360  27810       3
## 19  43870  44630  44960  45630  46600  47580  48900       3
## 20  42810  43390  43870  44420  45220  45990  46690       3
## 21  33770  34220  34500  34930  35490  36220  37190       3
## 22  32660  33200  33590  33860  34460  35160  36070       3
# Create a tidy data frame by gathering the year and values into two columns
(gathered_oes <- gather(data = clust_oes, 
                       key = year, 
                       value = mean_salary, 
                       -occupation, -cluster))
##                     occupation cluster year mean_salary
## 1                   Management       1 2001       70800
## 2          Business Operations       2 2001       50580
## 3             Computer Science       2 2001       60350
## 4     Architecture/Engineering       2 2001       56330
## 5    Life/Physical/Social Sci.       2 2001       49710
## 6           Community Services       3 2001       34190
## 7                        Legal       1 2001       69030
## 8   Education/Training/Library       3 2001       39130
## 9    Arts/Design/Entertainment       3 2001       39770
## 10    Healthcare Practitioners       2 2001       49930
## 11          Healthcare Support       3 2001       21900
## 12          Protective Service       3 2001       32530
## 13            Food Preparation       3 2001       16720
## 14   Grounds Cleaning & Maint.       3 2001       20380
## 15               Personal Care       3 2001       21010
## 16                       Sales       3 2001       28920
## 17       Office Administrative       3 2001       27230
## 18    Farming/Fishing/Forestry       3 2001       19630
## 19                Construction       3 2001       35450
## 20  Installation/Repair/Maint.       3 2001       34960
## 21                  Production       3 2001       27600
## 22       Transportation/Moving       3 2001       26560
## 23                  Management       1 2002       78870
## 24         Business Operations       2 2002       53350
## 25            Computer Science       2 2002       61630
## 26    Architecture/Engineering       2 2002       58020
## 27   Life/Physical/Social Sci.       2 2002       52380
## 28          Community Services       3 2002       34630
## 29                       Legal       1 2002       77330
## 30  Education/Training/Library       3 2002       40160
## 31   Arts/Design/Entertainment       3 2002       41660
## 32    Healthcare Practitioners       2 2002       53990
## 33          Healthcare Support       3 2002       22410
## 34          Protective Service       3 2002       33330
## 35            Food Preparation       3 2002       17180
## 36   Grounds Cleaning & Maint.       3 2002       20850
## 37               Personal Care       3 2002       21370
## 38                       Sales       3 2002       30610
## 39       Office Administrative       3 2002       27910
## 40    Farming/Fishing/Forestry       3 2002       20220
## 41                Construction       3 2002       36340
## 42  Installation/Repair/Maint.       3 2002       35780
## 43                  Production       3 2002       28190
## 44       Transportation/Moving       3 2002       27220
## 45                  Management       1 2003       83400
## 46         Business Operations       2 2003       56000
## 47            Computer Science       2 2003       64150
## 48    Architecture/Engineering       2 2003       60390
## 49   Life/Physical/Social Sci.       2 2003       54930
## 50          Community Services       3 2003       35800
## 51                       Legal       1 2003       78590
## 52  Education/Training/Library       3 2003       41390
## 53   Arts/Design/Entertainment       3 2003       43350
## 54    Healthcare Practitioners       2 2003       56240
## 55          Healthcare Support       3 2003       22960
## 56          Protective Service       3 2003       34430
## 57            Food Preparation       3 2003       17400
## 58   Grounds Cleaning & Maint.       3 2003       21290
## 59               Personal Care       3 2003       21570
## 60                       Sales       3 2003       31560
## 61       Office Administrative       3 2003       28540
## 62    Farming/Fishing/Forestry       3 2003       20290
## 63                Construction       3 2003       37000
## 64  Installation/Repair/Maint.       3 2003       36560
## 65                  Production       3 2003       28930
## 66       Transportation/Moving       3 2003       27630
## 67                  Management       1 2004       87090
## 68         Business Operations       2 2004       57120
## 69            Computer Science       2 2004       66370
## 70    Architecture/Engineering       2 2004       63060
## 71   Life/Physical/Social Sci.       2 2004       57550
## 72          Community Services       3 2004       37050
## 73                       Legal       1 2004       81180
## 74  Education/Training/Library       3 2004       42810
## 75   Arts/Design/Entertainment       3 2004       43820
## 76    Healthcare Practitioners       2 2004       58310
## 77          Healthcare Support       3 2004       23510
## 78          Protective Service       3 2004       35240
## 79            Food Preparation       3 2004       17620
## 80   Grounds Cleaning & Maint.       3 2004       21670
## 81               Personal Care       3 2004       22080
## 82                       Sales       3 2004       32280
## 83       Office Administrative       3 2004       29390
## 84    Farming/Fishing/Forestry       3 2004       20670
## 85                Construction       3 2004       37890
## 86  Installation/Repair/Maint.       3 2004       37620
## 87                  Production       3 2004       29480
## 88       Transportation/Moving       3 2004       28250
## 89                  Management       1 2005       88450
## 90         Business Operations       2 2005       57930
## 91            Computer Science       2 2005       67100
## 92    Architecture/Engineering       2 2005       63910
## 93   Life/Physical/Social Sci.       2 2005       58030
## 94          Community Services       3 2005       37530
## 95                       Legal       1 2005       81070
## 96  Education/Training/Library       3 2005       43450
## 97   Arts/Design/Entertainment       3 2005       44310
## 98    Healthcare Practitioners       2 2005       59170
## 99          Healthcare Support       3 2005       23850
## 100         Protective Service       3 2005       35750
## 101           Food Preparation       3 2005       17840
## 102  Grounds Cleaning & Maint.       3 2005       21930
## 103              Personal Care       3 2005       22180
## 104                      Sales       3 2005       32800
## 105      Office Administrative       3 2005       29710
## 106   Farming/Fishing/Forestry       3 2005       21010
## 107               Construction       3 2005       38260
## 108 Installation/Repair/Maint.       3 2005       38050
## 109                 Production       3 2005       29890
## 110      Transportation/Moving       3 2005       28820
## 111                 Management       1 2006       91930
## 112        Business Operations       2 2006       60000
## 113           Computer Science       2 2006       69240
## 114   Architecture/Engineering       2 2006       66190
## 115  Life/Physical/Social Sci.       2 2006       59660
## 116         Community Services       3 2006       39000
## 117                      Legal       1 2006       85360
## 118 Education/Training/Library       3 2006       45320
## 119  Arts/Design/Entertainment       3 2006       46110
## 120   Healthcare Practitioners       2 2006       62030
## 121         Healthcare Support       3 2006       24610
## 122         Protective Service       3 2006       37040
## 123           Food Preparation       3 2006       18430
## 124  Grounds Cleaning & Maint.       3 2006       22580
## 125              Personal Care       3 2006       22920
## 126                      Sales       3 2006       34350
## 127      Office Administrative       3 2006       30370
## 128   Farming/Fishing/Forestry       3 2006       21810
## 129               Construction       3 2006       39290
## 130 Installation/Repair/Maint.       3 2006       39060
## 131                 Production       3 2006       30480
## 132      Transportation/Moving       3 2006       29460
## 133                 Management       1 2007       96150
## 134        Business Operations       2 2007       62410
## 135           Computer Science       2 2007       72190
## 136   Architecture/Engineering       2 2007       68880
## 137  Life/Physical/Social Sci.       2 2007       62020
## 138         Community Services       3 2007       40540
## 139                      Legal       1 2007       88450
## 140 Education/Training/Library       3 2007       46610
## 141  Arts/Design/Entertainment       3 2007       48410
## 142   Healthcare Practitioners       2 2007       65020
## 143         Healthcare Support       3 2007       25600
## 144         Protective Service       3 2007       38750
## 145           Food Preparation       3 2007       19440
## 146  Grounds Cleaning & Maint.       3 2007       23560
## 147              Personal Care       3 2007       23980
## 148                      Sales       3 2007       35240
## 149      Office Administrative       3 2007       31200
## 150   Farming/Fishing/Forestry       3 2007       22640
## 151               Construction       3 2007       40620
## 152 Installation/Repair/Maint.       3 2007       39930
## 153                 Production       3 2007       31310
## 154      Transportation/Moving       3 2007       30680
## 155                 Management       1 2008      100310
## 156        Business Operations       2 2008       64720
## 157           Computer Science       2 2008       74500
## 158   Architecture/Engineering       2 2008       71430
## 159  Life/Physical/Social Sci.       2 2008       64280
## 160         Community Services       3 2008       41790
## 161                      Legal       1 2008       92270
## 162 Education/Training/Library       3 2008       48460
## 163  Arts/Design/Entertainment       3 2008       50670
## 164   Healthcare Practitioners       2 2008       67890
## 165         Healthcare Support       3 2008       26340
## 166         Protective Service       3 2008       40200
## 167           Food Preparation       3 2008       20220
## 168  Grounds Cleaning & Maint.       3 2008       24370
## 169              Personal Care       3 2008       24120
## 170                      Sales       3 2008       36080
## 171      Office Administrative       3 2008       32220
## 172   Farming/Fishing/Forestry       3 2008       23560
## 173               Construction       3 2008       42350
## 174 Installation/Repair/Maint.       3 2008       41230
## 175                 Production       3 2008       32320
## 176      Transportation/Moving       3 2008       31450
## 177                 Management       1 2010      105440
## 178        Business Operations       2 2010       67690
## 179           Computer Science       2 2010       77230
## 180   Architecture/Engineering       2 2010       75550
## 181  Life/Physical/Social Sci.       2 2010       66390
## 182         Community Services       3 2010       43180
## 183                      Legal       1 2010       96940
## 184 Education/Training/Library       3 2010       50440
## 185  Arts/Design/Entertainment       3 2010       52290
## 186   Healthcare Practitioners       2 2010       71280
## 187         Healthcare Support       3 2010       26920
## 188         Protective Service       3 2010       42490
## 189           Food Preparation       3 2010       21240
## 190  Grounds Cleaning & Maint.       3 2010       25300
## 191              Personal Care       3 2010       24590
## 192                      Sales       3 2010       36790
## 193      Office Administrative       3 2010       33470
## 194   Farming/Fishing/Forestry       3 2010       24330
## 195               Construction       3 2010       43870
## 196 Installation/Repair/Maint.       3 2010       42810
## 197                 Production       3 2010       33770
## 198      Transportation/Moving       3 2010       32660
## 199                 Management       1 2011      107410
## 200        Business Operations       2 2011       68740
## 201           Computer Science       2 2011       78730
## 202   Architecture/Engineering       2 2011       77120
## 203  Life/Physical/Social Sci.       2 2011       67470
## 204         Community Services       3 2011       43830
## 205                      Legal       1 2011       98380
## 206 Education/Training/Library       3 2011       50870
## 207  Arts/Design/Entertainment       3 2011       53850
## 208   Healthcare Practitioners       2 2011       72730
## 209         Healthcare Support       3 2011       27370
## 210         Protective Service       3 2011       42730
## 211           Food Preparation       3 2011       21430
## 212  Grounds Cleaning & Maint.       3 2011       25560
## 213              Personal Care       3 2011       24620
## 214                      Sales       3 2011       37520
## 215      Office Administrative       3 2011       34120
## 216   Farming/Fishing/Forestry       3 2011       24300
## 217               Construction       3 2011       44630
## 218 Installation/Repair/Maint.       3 2011       43390
## 219                 Production       3 2011       34220
## 220      Transportation/Moving       3 2011       33200
## 221                 Management       1 2012      108570
## 222        Business Operations       2 2012       69550
## 223           Computer Science       2 2012       80180
## 224   Architecture/Engineering       2 2012       79000
## 225  Life/Physical/Social Sci.       2 2012       68360
## 226         Community Services       3 2012       44240
## 227                      Legal       1 2012       98570
## 228 Education/Training/Library       3 2012       51210
## 229  Arts/Design/Entertainment       3 2012       54490
## 230   Healthcare Practitioners       2 2012       73540
## 231         Healthcare Support       3 2012       27780
## 232         Protective Service       3 2012       43050
## 233           Food Preparation       3 2012       21380
## 234  Grounds Cleaning & Maint.       3 2012       25670
## 235              Personal Care       3 2012       24550
## 236                      Sales       3 2012       37990
## 237      Office Administrative       3 2012       34410
## 238   Farming/Fishing/Forestry       3 2012       24230
## 239               Construction       3 2012       44960
## 240 Installation/Repair/Maint.       3 2012       43870
## 241                 Production       3 2012       34500
## 242      Transportation/Moving       3 2012       33590
## 243                 Management       1 2013      110550
## 244        Business Operations       2 2013       71020
## 245           Computer Science       2 2013       82010
## 246   Architecture/Engineering       2 2013       80100
## 247  Life/Physical/Social Sci.       2 2013       69400
## 248         Community Services       3 2013       44710
## 249                      Legal       1 2013       99620
## 250 Education/Training/Library       3 2013       51500
## 251  Arts/Design/Entertainment       3 2013       55580
## 252   Healthcare Practitioners       2 2013       74740
## 253         Healthcare Support       3 2013       28300
## 254         Protective Service       3 2013       43510
## 255           Food Preparation       3 2013       21580
## 256  Grounds Cleaning & Maint.       3 2013       26010
## 257              Personal Care       3 2013       24710
## 258                      Sales       3 2013       38200
## 259      Office Administrative       3 2013       34900
## 260   Farming/Fishing/Forestry       3 2013       24330
## 261               Construction       3 2013       45630
## 262 Installation/Repair/Maint.       3 2013       44420
## 263                 Production       3 2013       34930
## 264      Transportation/Moving       3 2013       33860
## 265                 Management       1 2014      112490
## 266        Business Operations       2 2014       72410
## 267           Computer Science       2 2014       83970
## 268   Architecture/Engineering       2 2014       81520
## 269  Life/Physical/Social Sci.       2 2014       70070
## 270         Community Services       3 2014       45310
## 271                      Legal       1 2014      101110
## 272 Education/Training/Library       3 2014       52210
## 273  Arts/Design/Entertainment       3 2014       55790
## 274   Healthcare Practitioners       2 2014       76010
## 275         Healthcare Support       3 2014       28820
## 276         Protective Service       3 2014       43980
## 277           Food Preparation       3 2014       21980
## 278  Grounds Cleaning & Maint.       3 2014       26370
## 279              Personal Care       3 2014       24980
## 280                      Sales       3 2014       38660
## 281      Office Administrative       3 2014       35530
## 282   Farming/Fishing/Forestry       3 2014       25160
## 283               Construction       3 2014       46600
## 284 Installation/Repair/Maint.       3 2014       45220
## 285                 Production       3 2014       35490
## 286      Transportation/Moving       3 2014       34460
## 287                 Management       1 2015      115020
## 288        Business Operations       2 2015       73800
## 289           Computer Science       2 2015       86170
## 290   Architecture/Engineering       2 2015       82980
## 291  Life/Physical/Social Sci.       2 2015       71220
## 292         Community Services       3 2015       46160
## 293                      Legal       1 2015      103460
## 294 Education/Training/Library       3 2015       53000
## 295  Arts/Design/Entertainment       3 2015       56980
## 296   Healthcare Practitioners       2 2015       77800
## 297         Healthcare Support       3 2015       29520
## 298         Protective Service       3 2015       44610
## 299           Food Preparation       3 2015       22850
## 300  Grounds Cleaning & Maint.       3 2015       27080
## 301              Personal Care       3 2015       25650
## 302                      Sales       3 2015       39320
## 303      Office Administrative       3 2015       36330
## 304   Farming/Fishing/Forestry       3 2015       26360
## 305               Construction       3 2015       47580
## 306 Installation/Repair/Maint.       3 2015       45990
## 307                 Production       3 2015       36220
## 308      Transportation/Moving       3 2015       35160
## 309                 Management       1 2016      118020
## 310        Business Operations       2 2016       75070
## 311           Computer Science       2 2016       87880
## 312   Architecture/Engineering       2 2016       84300
## 313  Life/Physical/Social Sci.       2 2016       72930
## 314         Community Services       3 2016       47200
## 315                      Legal       1 2016      105980
## 316 Education/Training/Library       3 2016       54520
## 317  Arts/Design/Entertainment       3 2016       58390
## 318   Healthcare Practitioners       2 2016       79160
## 319         Healthcare Support       3 2016       30470
## 320         Protective Service       3 2016       45810
## 321           Food Preparation       3 2016       23850
## 322  Grounds Cleaning & Maint.       3 2016       28010
## 323              Personal Care       3 2016       26510
## 324                      Sales       3 2016       40560
## 325      Office Administrative       3 2016       37260
## 326   Farming/Fishing/Forestry       3 2016       27810
## 327               Construction       3 2016       48900
## 328 Installation/Repair/Maint.       3 2016       46690
## 329                 Production       3 2016       37190
## 330      Transportation/Moving       3 2016       36070

Great work! You now have the data frames necessary to explore the results of this clustering

Hierarchical clustering: Plotting occupational clusters

You have successfully created all the parts necessary to explore the results of this hierarchical clustering work. In this exercise you will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.

# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
##                 Management                      Legal 
##                          1                          1 
##        Business Operations           Computer Science 
##                          2                          2 
##   Architecture/Engineering  Life/Physical/Social Sci. 
##                          2                          2 
##   Healthcare Practitioners         Community Services 
##                          2                          3 
## Education/Training/Library  Arts/Design/Entertainment 
##                          3                          3 
##         Healthcare Support         Protective Service 
##                          3                          3 
##           Food Preparation  Grounds Cleaning & Maint. 
##                          3                          3 
##              Personal Care                      Sales 
##                          3                          3 
##      Office Administrative   Farming/Fishing/Forestry 
##                          3                          3 
##               Construction Installation/Repair/Maint. 
##                          3                          3 
##                 Production      Transportation/Moving 
##                          3                          3
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) + 
    geom_line(aes(group = occupation))

Cool huh! From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let’s see what we can get by exploring this data using k-means.

Reviewing the HC results

Next steps: k-means clustering

K-means: Elbow analysis

In the previous exercises you used the dendrogram to propose a clustering that generated 3 trees. In this exercise you will leverage the k-means elbow plot to propose the “best” number of clusters.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

Fascinating! So the elbow analysis proposes a different value of k, in the next section let’s see what we can learn from Silhouette Width Analysis.

K-means: Average Silhouette Widths

So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. In this exercise use average silhouette widths to explore what the “best” value of k should be.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

Great work! It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).

All of the above methods are correct but the best way to cluster is highly dependent on how you would use this data after.

There is no quantitative way to determine which of these clustering approaches is the right one without further exploration. You are done with the course! If you enjoyed the material, feel free to send Dmitriy a thank you via Twitter. He’ll appreciate it. Tweet to Dmitriy

Review K-means results

A lot more to learn